Spark Execution Model

The Spark execution model can be defined in three phases: 
  • creating the logical plan
  • translating that into a physical plan
  • executing the tasks on a cluster
You can view useful information about your Spark jobs in real time in a web browser with this URL: http://:4040. For Spark applications that have finished, you can use the Spark history server to see this information in a web browser at this URL: http://:18080. Let’s walk through the three phases and the Spark UI information about the phases, with some example code.

The Logical Plan
In the first phase, the logical plan is created. This is the plan that shows which steps will be executed when an action gets applied. Recall that when you apply a transformation on a Dataset, a new Dataset is created. When this happens, that new Dataset points back to the parent, resulting in a lineage or directed acyclic graph (DAG) for how Spark will execute these transformations.

The Physical Plan
Actions trigger the translation of the logical DAG into a physical execution plan. The Spark Catalyst query optimizer creates the physical execution plan for DataFrames, as shown in the diagram below:


(Image reference: Databricks)

The physical plan identifies resources, such as memory partitions and compute tasks, that will execute the plan.

No comments:

Post a Comment